Predict Runoff Statistics from Catchment Attributes#

Introduction#

We compute the (ordinary) statistics of runoff time series for a large sample of catchments. We then test the predictability of these statistics from catchment attributes using a gradient boosting machine learning approach. The purpose of testing predictability of the order statistics and L-moments is to fully describe parametric distributions for estimating streamflow distributions (flow duration curves), a common problem in applied hydrology.

Catchment attributes are used as predictors of each order statistic. Attributes are added cumulatively in successive tests to compare the contribution of catchment attribute groups related to climate, terrain, land cover, and soil.

We test if transforming the target variable has an effect on the xgboost model. For transformations that do not change the rank of the target variable, i would not expect to see an effect, however the metric used as the objective function may be sensitive to the distribution of target variables, that is sensitive to outliers. We test the predictability of the following:

Statistic

Description

<param>_mean

Mean of the <param> series (central tendency)

<param>_std

Standard deviation of the <param> series (spread)

<param>_mean

Mean of the log transformed <param> series (central tendency)

<param>_std

Standard deviation of the log transformed <param> series (spread)

Notes:

  • The <param> is the name of the runoff time series:

    • mm: unit area runoff in mm`.

    • uar: unit area runoff in \(L/s/\text{km}^2\).

    • logx: the logarithm of the unit area runoff is also used in computing each order statistics for prediction.

import os
import pandas as pd
import numpy as np
from pathlib import Path
import xarray as xr
from scipy.stats import skew, kurtosis
from math import comb

from bokeh.plotting import figure, show
from bokeh.layouts import gridplot, row, column
from bokeh.models import HoverTool, ColumnDataSource, Whisker, Legend, LegendItem
from bokeh.transform import factor_cmap, factor_mark
from bokeh.palettes import Category10, Category20, Category20c, Viridis
from bokeh.io import output_notebook
# from bokeh.palettes import Sunset10, Vibrant7

from sklearn.metrics import r2_score

import data_processing_functions as dpf

from scipy.stats import linregress
output_notebook()

BASE_DIR = os.getcwd()
Loading BokehJS ...

Load Input Data#

Extra catchments to exclude#

  • Kakuhan Creek Near Haines AK - 15056030

image.png

Excluded due to no complete years of data (seasonal / <= 90% complete)#

  • Genessee Creek at the Mouth - 08FA009

  • McNair Creek near Port Mellon - 08GA037

  • Canoe River near Valemount - 08NC003

  • Big Quilcene River Near Quilcene, WA - 12052500

  • Morey Creek above McChord Afb near Parkland, WA - 12090480

  • North Fork Newaukum Creek Near Enumclaw, WA - 12107950

  • Newaukum Creek Tributary Near Blacik Diamond, WA - 12108450

  • May Creek near Issaquah, WA - 12119300

  • Honey Creek near Renton, WA - 12119450

  • Carpenter Creek near Bacon Rod near Mount Vernon, WA - 12200684

  • Unnamed Tributary Massacre Bay on Orcas Island, WA - 12200762

  • Whatcom Creek near Bellingham, WA - 12203000

  • Hall Creek at Inchelium, WA - 12409500

  • Dayebas Creek Near Haines, AK - 15056070

  • Bonne Creek near Klawock, AK - 15081510

# extra stations to exclude
exclude = ['15056030', '08FA009', '08GA037', '08NC003', '12052500', '12090480', '12107950', '12108450', '12119300',
           '12119450', '12200684', '12200762', '12203000', '12409500', '15056070', '15081510']

# df = df[~df['official_id'].isin(exclude)]
# load the catchment characteristics
rev_date = '20250227'
HYSETS_DIR = Path('/home/danbot/code/common_data/HYSETS')

# load camels hydro attributes to compare with the BCUB data
camels_df = pd.read_csv('data/camels/camels_hydro.txt', sep=';')
camels_df['gauge_id'] = camels_df['gauge_id'].astype(str)
camels_df.head()

hs_df = pd.read_csv('data/HYSETS_watershed_properties.txt', sep=';', dtype={'Watershed_ID': int, 'Official_ID': str})

# create dictionaries for quick lookups
da_dict = {row['Official_ID']: row['Drainage_Area_km2'] for _, row in hs_df.iterrows()}
# needed to map between watershed ID in Hysets data and official_ID
watershed_id_dict = {row['Watershed_ID']: row['Official_ID'] for _, row in hs_df.iterrows()}
# and the inverse
official_id_dict = {row['Official_ID']: row['Watershed_ID'] for _, row in hs_df.iterrows()}
attribute_file = f'BCUB_watershed_attributes_updated_{rev_date}.csv'
updated_attribute_file = 'catchment_attributes_with_runoff_stats.csv'
if not os.path.exists(os.path.join('data', updated_attribute_file)):
    print(f'Updated attribute file {updated_attribute_file} not found. Using {attribute_file} instead.')
    updated_attribute_path = os.path.join('data', attribute_file)
    process_statistics = True
else:
    updated_attribute_path = os.path.join(os.getcwd(), 'data', updated_attribute_file)
    process_statistics = False

df = pd.read_csv(updated_attribute_path, dtype={'official_id': str})
print(len(df))
df = df[[c for c in df.columns if 'unnamed:' not in c.lower()]]
# exclude = ['15039900','15031000']
df.columns = [c.lower() for c in df.columns]
# assert '12414900' in df['official_id'].values
df.sort_values('official_id', inplace=True)
df.reset_index(drop=True, inplace=True)
1024
def load_and_filter_hysets_data(station_ids, hs_df):

    # load the updated HYSETS data
    updated_filename = 'HYSETS_2023_update_QC_stations.nc'
    ds = xr.open_dataset(HYSETS_DIR / updated_filename)

    # Get valid IDs as a NumPy array
    hs_df = hs_df[hs_df['Official_ID'].isin(station_ids)]
    selected_ids = hs_df['Watershed_ID'].values

    # Get boolean index where watershedID in selected_set
    # safely access watershedID as a variable first
    ws_ids = ds['watershedID'].data  # or .values if you prefer
    mask = np.isin(ws_ids, selected_ids)

    # Apply mask to data
    ds = ds.sel(watershed=mask)
    # Step 1: Promote 'watershedID' to a coordinate on the 'watershed' dimension
    ds = ds.assign_coords(watershedID=("watershed", ds["watershedID"].data))

    # Step 2: Set 'watershedID' as the index for the 'watershed' dimension
    return ds.set_index(watershed="watershedID")


def retrieve_timeseries_discharge(stn, ds):
    watershed_id = official_id_dict[stn]
    # drainage_area = self.ctx.da_dict[stn]
    # data = self.ctx.data
    da = da_dict[stn]
    df = ds['discharge'].sel(watershed=str(watershed_id)).to_dataframe(name='discharge').reset_index()
    df = df.set_index('time')[['discharge']]
    df.dropna(inplace=True)
    # clip minimum flow to 1e-4
    df['discharge'] = np.clip(df['discharge'], 1e-4, None)
    df.rename(columns={'discharge': stn}, inplace=True)
    df[f'{stn}_uar'] = 1000 * df[stn] / da
    df[f'{stn}_mm'] = df[stn] * (24 * 3.6 / da)
    df['log_x'] = np.log(df[f'{stn}_uar'])
    return df
ds = load_and_filter_hysets_data(df['official_id'], hs_df)

Filter the dataset for stations meeting minimum record length.

from lmoments3 import distr

def stats(values):
    for label in ['uar', 'log_uar']:
        if label.startswith('log_'):
            values = np.log(values)
        # classical moments
        m   = values.mean()
        s   = values.std(ddof=1)
        sk  = pd.Series(values).skew()
        kt  = pd.Series(values).kurtosis()
        # l-moments
        params = distr.gev.lmom_fit(values)
        return {
            f'{label}_mean': m,
            f'{label}_std': s,
            f'{label}_skew': sk,
            f'{label}_kurt': kt,
            f'{label}_lmom_xi': params['c'],
            f'{label}_lmom_loc': params['loc'],
            f'{label}_lmom_scale': params['scale'],
        }

# reset the index to ensure the split is done correctly
def process_row(data):
    stn = str(data['official_id'])
    data = retrieve_timeseries_discharge(stn, ds)
    # Compute the runoff statistics
    runoff_data = stats(data[f'{stn}_uar'].values)
    camels_data = camels_df[camels_df['gauge_id'] == stn].copy()
    if len(camels_data) > 1:
        camels_q = camels_data['q_mean'].values[0]
        raise Exception(f'Multiple CAMELS data found for {stn}.')
    else:
        camels_q = camels_data['q_mean'].values[0] if not camels_data.empty else np.nan

    # Merge your existing mm‐based mean + the new metrics
    out = {
      **runoff_data,
      'camels_q_mean_mm': camels_q,
    }
    return pd.Series(out)
# set the minimum number of complete years required
min_record_years = 5

filtered_df = df[df['n_complete_years'] >= min_record_years]

if process_statistics == True:
    print(f'Processing runoff statistics for {len(filtered_df)} stations')
    updated_fpath = os.path.join(os.getcwd(), 'data', f'catchment_attributes_with_runoff_stats.csv')
    stats_results = filtered_df.apply(lambda x: process_row(x), axis=1)
    target_cols = stats_results.columns.tolist()
    filtered_df.loc[stats_results.index, stats_results.columns] = stats_results
    print(f'   Saving updated attributes with runoff statistics for {len(filtered_df)} catchments to:', updated_fpath)
    filtered_df.to_csv(updated_fpath)
target_cols = [
    'mean_uar', 'sd_uar', 
    'mean_logx', 'sd_logx', 
]

# target_cols += [f'prob_q_lessthan_{threshold}' for threshold in [1e-4, 5e-4, 1e-3, 5e-3, 1e-2]]
assert np.all([c in df.columns for c in target_cols]), f"Not all target columns found in df: {target_cols} \n {list(df.columns)}"
from bokeh.models import ColorBar, LogColorMapper, ColumnDataSource

# visualize the distribution of the mean runoff
p1 = figure(title="Mean runoff distribution", width=500, height=300)
hist, edges = np.histogram(df['mean_uar'], density=True, bins=50)
p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color='white')
p1.xaxis.axis_label = 'Mean runoff [L/s/km^2]'
p1.yaxis.axis_label = 'Frequency'

# visualize a scatter plot of the mean and standard deviation
p2 = figure(title=f"Mean vs. Standard deviation (N={len(df)})", width=500, height=300)
# add color mapper to encode drainage_area_km2
mapper = LogColorMapper(palette='Viridis256', low=df['drainage_area_km2'].min(), 
                        high=df['drainage_area_km2'].max())
source = ColumnDataSource(df)
color_bar = ColorBar(color_mapper=mapper, width=8, location=(0,0), title='Drainage area [km²]',)
p2.add_layout(color_bar, 'right')
# add the scatter plot with color mapping
p2.scatter('mean_uar', 'sd_uar', source=source, color={'field': 'drainage_area_km2', 'transform': mapper})

x = np.linspace(0, df['mean_uar'].max(), 100)
slope, intercept, r, pval, se = linregress(df['mean_uar'].values, df['sd_uar'].values)
y = [slope*e + intercept for e in x]
p2.line(x, y, legend_label=f'L2: y={slope:.2f}x + {intercept:.2f} (R²={r**2:.2f})', 
       line_width=2, color='red', line_dash='dashed')

# plot the L1 Norm -- L2 is sensitive to outliers and is biased for low mean values
# l1_slope, l1_intercept = L1_fit_line(df['mean_uar'].values, df['sd_uar'].values)
# yl1 = [l1_slope*e + l1_intercept for e in x]
# p2.line(x, yl1, legend_label=f'L1: y={l1_slope:.2f}x + {l1_intercept:.2f}', 
#        line_width=2, color='red', line_dash='dotted')
p2.line([0, df['mean_uar'].max()], [0, df['mean_uar'].max()], 
        line_width=2, color='black', line_dash='dotted', legend_label='1:1')

# add hover tool to show official_id
p2.add_tools(HoverTool(tooltips=[('Official ID', '@official_id')]))
# p2.xaxis.axis_label = 'Mean runoff [mm/day]'
p2.xaxis.axis_label = 'Mean runoff [L/s/km^2]'
p2.yaxis.axis_label = 'Standard deviation'
p2.legend.location = 'top_left'
p2.legend.background_fill_alpha = 0.5
p2.legend.click_policy = 'hide'
show(row(p1, p2))

Define attribute groups#

terrain = ['drainage_area_km2', 'elevation_m', 'slope_deg', 'aspect_deg']
land_cover = [
    'land_use_forest_frac_2010', 'land_use_grass_frac_2010', 'land_use_wetland_frac_2010', 'land_use_water_frac_2010', 
    'land_use_urban_frac_2010', 'land_use_shrubs_frac_2010', 'land_use_crops_frac_2010', 'land_use_snow_ice_frac_2010']
climate = ['prcp', 'srad', 'swe', 'tmax', 'tmin', 'vp', 'high_prcp_freq', 'high_prcp_duration', 'low_prcp_freq', 'low_prcp_duration']
soil = ['logk_ice_x100', 'porosity_x100']

all_attributes = terrain + land_cover + soil + climate
assert len([c for c in all_attributes if c not in df.columns]) == 0
len(all_attributes)
24
results_folder = os.path.join(BASE_DIR, 'data', 'results', 'parameter_prediction_results')
if not os.path.exists(results_folder):
    os.makedirs(results_folder)
nfolds = 5
n_boost_rounds = 500
n_optimization_rounds = 20
loss = 'reg:squarederror'
# loss = 'reg:absoluteerror'

all_test_results = {}
attribute_set_dict = {
    'climate': climate, 
    '+land_cover': land_cover,
    '+terrain': terrain, 
    '+soil': soil,
}

Set Attribute Groupings#

group_1 = ['climate', '+terrain', '+land_cover', '+soil']
# for group 2, just reverse group 1
group_2 = group_1[::-1]
group_3 = ['+land_cover', '+terrain', '+soil', 'climate']
group_4 = ['+soil', 'climate', '+land_cover', '+terrain']
attribute_group_orders = [group_1, group_2, group_3, group_4]

Run XGBoost Models#

Separate the test set at the outset so the attribute group ordering is tested on the same hold-out set but necessarily on unique training optimizations. This ensures that at least the presence of outliers in the hold-out set should at least be constant across the attribute group reordering.

from xgb_functions import run_xgb_CV_trials

def predict_runoff_from_attributes(df, target_column, group_order, results_folder, n_boost_rounds, n_optimization_rounds, loss):
    """
    Note that here we're predicting the log mean unit area runoff
    """
    # set the target column
    predictor_attributes = []
    results = {}
    # add attribute groups successively
    for set_name in group_order:
        print(f' Processing {set_name} attribute set')
        predictor_attributes += attribute_set_dict[set_name] 
        input_data = df[['official_id'] + predictor_attributes + [target_column]].copy()
        result_df, all_predictions_df, all_convergence_df = run_xgb_CV_trials(
            set_name, predictor_attributes, target_column, input_data, 
            n_optimization_rounds, nfolds, n_boost_rounds, results_folder, 
            loss=loss,
        )
        # store the test set predictions and actuals
        results[set_name] = {
            'all_results': result_df,
            'convergence': all_convergence_df,
            'oos_predictions': all_predictions_df,
        } 
    return results
results_dict = {}
group_results_dict = {}
eval_metrics = {'reg:squarederror': 'test_rmse', 'reg:absoluteerror': 'test_mae'}
for target_col in target_cols:
    n  = 0
    min_error = 1e9    
    best_set = None
    results_dict[target_col] = {}
    group_results_dict[target_col] = {}
    print(f'TARGET: {target_col}')
    eval_metric = eval_metrics[loss]
    
    for group in attribute_group_orders:
        n += 1
        print(f'Processing: {group} ordering. {n}/{len(attribute_group_orders)}.')
        
        group_results_fname = f'{target_col}_prediction_results_{"".join(group)}.npy'
        group_results_fpath = os.path.join(results_folder, group_results_fname)

        print(f'Group results file: {group_results_fpath}')
        
        if os.path.exists(group_results_fpath):
            print(f'Retrieving existing results from {group_results_fpath}')
            group_results = np.load(group_results_fpath, allow_pickle=True).item()
        else:
            group_results = predict_runoff_from_attributes(df, target_col, group, results_folder, n_boost_rounds, n_optimization_rounds, loss)
            np.save(group_results_fpath, group_results)
    
        group_results_dict[target_col][n] = {'order': group, 'results': group_results}
        for k, test_data in group_results.items():
            # Save all results, not just the best
            results_dict[target_col][f'group_{n}'] = {
                'group_order': group,
                'test_error': test_data['all_results'][f'{eval_metric}_mean'],
                'test_error_std': test_data['all_results'][f'{eval_metric}_stdev'],
                'convergence': test_data['convergence'],
                'oos_predictions': test_data['oos_predictions'],
            }
            # print(f'Group {n} {k} {eval_metric}: {results_dict[target_col][f"group_{n}"]["test_error"]}')
            summary_csv = f'{target_col}_summary_group_{n}.csv'
            test_data['oos_predictions'].to_csv(os.path.join(results_folder, summary_csv), index=False)
TARGET: mean_uar
Processing: ['climate', '+terrain', '+land_cover', '+soil'] ordering. 1/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_uar_prediction_results_climate+terrain+land_cover+soil.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_uar_prediction_results_climate+terrain+land_cover+soil.npy
Processing: ['+soil', '+land_cover', '+terrain', 'climate'] ordering. 2/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_uar_prediction_results_+soil+land_cover+terrainclimate.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_uar_prediction_results_+soil+land_cover+terrainclimate.npy
Processing: ['+land_cover', '+terrain', '+soil', 'climate'] ordering. 3/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_uar_prediction_results_+land_cover+terrain+soilclimate.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_uar_prediction_results_+land_cover+terrain+soilclimate.npy
Processing: ['+soil', 'climate', '+land_cover', '+terrain'] ordering. 4/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_uar_prediction_results_+soilclimate+land_cover+terrain.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_uar_prediction_results_+soilclimate+land_cover+terrain.npy
TARGET: sd_uar
Processing: ['climate', '+terrain', '+land_cover', '+soil'] ordering. 1/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_uar_prediction_results_climate+terrain+land_cover+soil.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_uar_prediction_results_climate+terrain+land_cover+soil.npy
Processing: ['+soil', '+land_cover', '+terrain', 'climate'] ordering. 2/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_uar_prediction_results_+soil+land_cover+terrainclimate.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_uar_prediction_results_+soil+land_cover+terrainclimate.npy
Processing: ['+land_cover', '+terrain', '+soil', 'climate'] ordering. 3/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_uar_prediction_results_+land_cover+terrain+soilclimate.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_uar_prediction_results_+land_cover+terrain+soilclimate.npy
Processing: ['+soil', 'climate', '+land_cover', '+terrain'] ordering. 4/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_uar_prediction_results_+soilclimate+land_cover+terrain.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_uar_prediction_results_+soilclimate+land_cover+terrain.npy
TARGET: mean_logx
Processing: ['climate', '+terrain', '+land_cover', '+soil'] ordering. 1/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_logx_prediction_results_climate+terrain+land_cover+soil.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_logx_prediction_results_climate+terrain+land_cover+soil.npy
Processing: ['+soil', '+land_cover', '+terrain', 'climate'] ordering. 2/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_logx_prediction_results_+soil+land_cover+terrainclimate.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_logx_prediction_results_+soil+land_cover+terrainclimate.npy
Processing: ['+land_cover', '+terrain', '+soil', 'climate'] ordering. 3/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_logx_prediction_results_+land_cover+terrain+soilclimate.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_logx_prediction_results_+land_cover+terrain+soilclimate.npy
Processing: ['+soil', 'climate', '+land_cover', '+terrain'] ordering. 4/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_logx_prediction_results_+soilclimate+land_cover+terrain.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/mean_logx_prediction_results_+soilclimate+land_cover+terrain.npy
TARGET: sd_logx
Processing: ['climate', '+terrain', '+land_cover', '+soil'] ordering. 1/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_logx_prediction_results_climate+terrain+land_cover+soil.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_logx_prediction_results_climate+terrain+land_cover+soil.npy
Processing: ['+soil', '+land_cover', '+terrain', 'climate'] ordering. 2/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_logx_prediction_results_+soil+land_cover+terrainclimate.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_logx_prediction_results_+soil+land_cover+terrainclimate.npy
Processing: ['+land_cover', '+terrain', '+soil', 'climate'] ordering. 3/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_logx_prediction_results_+land_cover+terrain+soilclimate.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_logx_prediction_results_+land_cover+terrain+soilclimate.npy
Processing: ['+soil', 'climate', '+land_cover', '+terrain'] ordering. 4/4.
Group results file: /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_logx_prediction_results_+soilclimate+land_cover+terrain.npy
Retrieving existing results from /home/danbot/code/distribution_estimation/docs/notebooks/data/results/parameter_prediction_results/sd_logx_prediction_results_+soilclimate+land_cover+terrain.npy
label_dict = {
    'mean_uar': {'x': r'$$\mu_i \left[ L s^{-1} \text{km}^{-2}\right]$$', 'y': r'$$\hat \mu_i \left[L s^{-1} \text{km}^{-2} \right]$$'}, 
    'sd_uar': {'x': r'$$\sigma_i \left[ L s^{-1} \text{km}^{-2}\right]$$', 'y': r'$$\hat \sigma_i \left[ L s^{-1} \text{km}^{-2} \right]$$'}, 
    'mean_logx': {'x': r'$$\text{\log \mu} \left[ L s^{-1} \text{km}^{-2} \right]$$', 'y': r'$$ \text{\log \hat \mu}  \left[ L s^{-1} \text{km}^{-2} \right]$$'}, 
    'sd_logx': {'x': r'$$\log\ sigma} \left[ L s^{-1} \text{km}^{-2} \right]$$', 'y': r'$$\text{\log \hat \sigma}  \left[ L s^{-1} \text{km}^{-2} \right]$$'}, 
}

target_cols = [
    'mean_uar', 'sd_uar', 
    'mean_logx', 'sd_logx',
]

View Results#

def create_results_plots(target_col, results_df, attribute_sets, eval_metric, title=''):    
    plots = []
    def _plot_attribute_order_line():
        means, lbs, ubs = [], [], []

        for e in attribute_sets:
            df = results_df[e]['all_results']
            trial_means = df[f'{eval_metric}_mean'].values
            means.append(np.mean(trial_means))
            lb, ub = np.percentile(trial_means, [2.5, 97.5])
            lbs.append(lb)
            ubs.append(ub)

        max_range = (0.95*min(lbs), max(ubs)*1.05)
        group_order = [e if not e.startswith('+') else e[1:] for e in attribute_sets]
        group_order = group_order[:1] + [f'+{e}' for e in group_order[1:]]  # add '+' to all but the first element
        source = ColumnDataSource({'x': group_order, 'y1': means, 'ub': ubs, 'lb': lbs})
    
        fig = figure(title='', x_range=group_order, y_range=max_range)#y_range=max_range)
        fig.scatter('x', 'y1', legend_label=eval_metric.upper(), color='green', source=source, line_width=3, marker='square', size=4)
        fig.add_layout(Whisker(source=source, base='x', upper='ub', lower='lb', line_width=1))
        fig.legend.background_fill_alpha = 0.6
        fig.yaxis.axis_label = r'$$\text{RMSE}$$'
        fig.xaxis.axis_label = r'$$\text{Attribute Groups}$$'
        best_set = min(attribute_sets, key=lambda x: results_df[x]['all_results'][f'{eval_metric}_mean'].mean())
        return fig, best_set


    def _plot_scatter_with_regression(best_result_df, xlabel, ylabel, target_col):
        trial_r2 = (
            best_result_df.groupby('trial')[['actual', 'predicted']]
            .apply(lambda df: r2_score(df['actual'], df['predicted']))
        )
        # r2_mean = trial_r2.mean()
        r2_std = trial_r2.std()
        grouped = best_result_df.groupby('trial').agg({
            'actual': 'median',
            'predicted': 'median',
        })
        grouped['diff'] = (grouped['actual'] - grouped['predicted']).abs()
        median_trial = grouped.sort_values('diff').index[len(grouped) // 2]
        best_result = best_result_df[best_result_df['trial'] == median_trial].copy()
        xx, yy = best_result['actual'].values, best_result['predicted'].values
        source = ColumnDataSource({'x': xx, 'y': yy, 'ID': best_result['official_id'].values})
        slope, intercept, r, p, se = linregress(xx, yy)

        sfig = figure(title='')
        sfig.scatter('x', 'y', size=3, alpha=0.8, source=source, legend_label=target_col, color='blue')
        sfig.add_tools(HoverTool(tooltips=[('ID', '@ID')]))
        x_obs = np.linspace(min(xx), max(xx), 1000)
        ybf = [slope * e + intercept for e in x_obs]
        sfig.line(x_obs, ybf, color='red', line_width=3, line_dash='dashed', legend_label=f'R²={r**2:.2f} ± {r2_std:.3f}')
        sfig.line([0, max(ybf)], [0, max(ybf)], color='black', line_dash='dotted', line_width=2, legend_label='1:1')
        sfig.xaxis.axis_label = xlabel
        sfig.yaxis.axis_label = ylabel
        sfig.legend.background_fill_alpha = 0.5
        sfig.legend.location = 'top_left'
        return sfig, median_trial

    def _plot_convergence(convergence_df, median_trial):
        cfig = figure(title='')

        data = convergence_df[convergence_df['trial'] == median_trial].copy()
        fold_nos = sorted(set(convergence_df['fold']))

        min_pred_risk = 1e9
        for fn in fold_nos:
            fold_data = data[data['fold'] == fn].copy()
            cfig.line(fold_data['round'], fold_data[f'test'], line_alpha=0.6, line_color='red', line_dash='dotted', legend_label=f'Test Folds')
            cfig.line(fold_data['round'], fold_data[f'train'], line_alpha=0.7, line_color='grey', line_dash='dotted', legend_label=f'Train Folds')

        train_pivot = data.pivot_table(index='round', columns='fold', values='train')#, aggfunc='first')
        test_pivot = data.pivot_table(index='round', columns='fold', values='test')#, aggfunc='first')
        train_pivot.columns = [f'fold_{col}' for col in train_pivot.columns]
        test_pivot.columns = [f'fold_{col}' for col in test_pivot.columns]

        train_pivot['mean'] = train_pivot.mean(axis=1)
        test_pivot['mean'] = test_pivot.mean(axis=1)
        cfig.line(train_pivot.index, train_pivot['mean'], line_alpha=0.5, line_color='grey', line_width=2, legend_label='CV Mean (Train)')
        cfig.line(test_pivot.index, test_pivot['mean'], line_alpha=0.5, line_color='red', line_width=2, legend_label='CV Mean (Test)')

        min_pred_risk_idx = test_pivot['mean'].idxmin()
        min_pred_risk = test_pivot.loc[min_pred_risk_idx, 'mean']
        if min_pred_risk_idx == max(test_pivot['mean'].index):
            print(f'Min prediction risk occurs at the maximum iteration, try increasing the number of boosting rounds')
        cfig.line([min_pred_risk_idx, min_pred_risk_idx], [0, min_pred_risk], legend_label='Min risk', color='green', line_width=2, line_dash='dashed')
        cfig.legend.location = 'top_right'
        cfig.legend.background_fill_alpha = 0.5
        cfig.xaxis.axis_label = r'$$\text{Iteration}$$'
        cfig.yaxis.axis_label = r'$$\text{RMSE}$$'
        return cfig

    def _plot_target_cdfs(cdf_arrays, xlabel):
        cdffig = figure(title='', x_axis_type='log')
        for (cdfx, cdfy) in cdf_arrays:
            cdffig.line(cdfx, cdfy, color='black', line_alpha=0.6, line_width=2, legend_label='Fold CDFs')
        cdffig.xaxis.axis_label = xlabel
        cdffig.yaxis.axis_label = r'$$\text{Pr}(X\leq x) $$'
        cdffig.legend.location = 'top_left'
        return cdffig
    
    def _compute_empirical_cdf(data):
        sorted_data = np.sort(data)
        cdf = np.arange(1, len(sorted_data) + 1) / len(sorted_data)
        return sorted_data, cdf

    xlabel = label_dict[target_col]['x']
    ylabel = label_dict[target_col]['y']

    # Plot 1: RMSE/MAE across attribute sets
    attr_fig, best_attr_set = _plot_attribute_order_line()
    # attr_fig = _plot_attribute_order_cdf()
    plots.append(attr_fig)

    # Plot 2: Actual vs Predicted for best set
    best_attr = [e for e in results_df.keys() if e.endswith(best_attr_set)][0]
    best_result = results_df[best_attr]['oos_predictions']
    sfig, median_trial = _plot_scatter_with_regression(best_result, xlabel, ylabel, target_col)
    plots.append(sfig)

    # Plot 3: Convergence plot
    convergence_df = results_df[best_attr]['convergence']
    cfig = _plot_convergence(convergence_df, median_trial)
    plots.append(cfig)

    # Plot 4: CDFs
    cdf_arrays = []
    for i, grp_result in results_df[best_attr]['oos_predictions'].groupby('fold'):
        sorted_data, cdf = _compute_empirical_cdf(grp_result['actual'].values)
        cdf_arrays.append((sorted_data, cdf))
    cdffig = _plot_target_cdfs(cdf_arrays, xlabel)
    plots.append(cdffig)

    return plots

In the sequence of plots below, we change the order that groups of attributes are added to training.

We split the CF folds using a stratified approach to balance the target variable distribution in each fold (right-most plot).

The plot 3rd from left shows the objective score at each iteration to show how the training and test sets compare. We don’t want to continue training when the test sets are not improving.

target_ranges = {'mean_uar': (14, 35), 'sd_uar': (20, 45),
                'mean_logx': (0.45, 1.25), 'sd_logx': (0.4, 0.5)}
target_ranges = {'mean_uar': (14, 35), 'sd_uar': (20, 45),
                'mean_logx': (0.45, 1.25), 'sd_logx': (0.4, 0.5)}
n = 4
all_plots = []
for c in target_cols:
    data = group_results_dict[c][n]
    eval_metric = eval_metrics[loss]
    grp_plots = create_results_plots(c, data['results'], data['order'], eval_metric)
    all_plots += grp_plots

layout = gridplot(all_plots, ncols=4, width=300, height=275)
show(layout)
n = 3
all_plots = []
for c in target_cols:
    data = group_results_dict[c][n]
    eval_metric = eval_metrics[loss]
    grp_plots = create_results_plots(c, data['results'], data['order'], eval_metric)
    all_plots += grp_plots

layout = gridplot(all_plots, ncols=4, width=300, height=275)
show(layout)
n = 2
all_plots = []
for c in target_cols:
    data = group_results_dict[c][n]
    eval_metric = eval_metrics[loss]
    grp_plots = create_results_plots(c, data['results'], data['order'], eval_metric)
    all_plots += grp_plots

layout = gridplot(all_plots, ncols=4, width=300, height=275)
show(layout)
Min prediction risk occurs at the maximum iteration, try increasing the number of boosting rounds
n = 1
all_plots = []
for c in target_cols:
    data = group_results_dict[c][n]
    eval_metric = eval_metrics[loss]
    grp_plots = create_results_plots(c, data['results'], data['order'], eval_metric)
    all_plots += grp_plots

layout = gridplot(all_plots, ncols=4, width=300, height=275)
show(layout)

Test the sensitivity to Order of attribute groups#

Note that in the first plot (at left) the attribute groups are additive. That is, the GBM is trained on just the first group listed in the categorical x-axis, then the second group attributes are added to the first, then the third group is added to the first and second, and so forth.

Test randomly permuted target values#

As a last iteration, randomize the order of the mean_runoff column to test what the algorithm is learning.

The predictive power decreases substantially across all groupings of input attributes.

random_results = {}
group_order = group_1
for tc in target_cols:    
    print(f'Processing randomization for target column: {tc}')
    test_results_fname = f'{tc}_prediction_results_shuffled.npy'
    test_results_fpath = os.path.join(results_folder, test_results_fname)
    
    if os.path.exists(test_results_fpath):
        shuffled_test_results = np.load(test_results_fpath, allow_pickle=True).item()
    else:
        shuffled_df = df.copy()
        for attr in all_attributes:
            # randomly shuffle the order of attribute values
            attr_values = df[attr].values
            np.random.shuffle(attr_values)
            shuffled_df[attr] = attr_values
        
        shuffled_test_results = predict_runoff_from_attributes(shuffled_df, tc, group_order, results_folder, n_boost_rounds, n_optimization_rounds, loss)
        np.save(test_results_fpath, shuffled_test_results)

    random_results[tc] = {'order': group_order, 'results': shuffled_test_results}
Processing randomization for target column: mean_uar
Processing randomization for target column: sd_uar
Processing randomization for target column: mean_logx
Processing randomization for target column: sd_logx

View results of shuffled target variable (mean runoff)#

all_shuffled_plots = []
for tc in target_cols:
    shuffled_results = random_results[tc]['results'].copy()
    group_order = random_results[tc]['order']
    shuffled_runoff_plots = create_results_plots(tc, shuffled_results, group_order, eval_metric, title='')
    all_shuffled_plots += shuffled_runoff_plots

layout = gridplot(all_shuffled_plots, ncols=4, width=300, height=275)
show(layout)
# process the results into a dictionary object for easy access to predicted values
mean_predictions = []
for tc in target_cols:
    group_results = group_results_dict[tc]
    group, group_data = 1, group_results[1]
    eval_group = '+land_cover' # evaluate on climate + terrain + land_cover set (neglect soil)
    data = group_data['results'][eval_group]
    oos_predictions = (
        data['oos_predictions']
        .groupby('official_id')['predicted']
        .mean()
        .to_frame()
    )
    # make a dict of official_id: actual from data['oos_predictions']
    actual_dict = data['oos_predictions'].set_index('official_id')['actual'].to_dict()
    # store the predictions in a dataframe
    oos_predictions.rename(columns={'predicted': f'{tc}_mean_predicted'}, inplace=True)
    oos_predictions = oos_predictions[[f'{tc}_mean_predicted']]
    oos_predictions[f'{tc}_actual'] = oos_predictions.index.map(actual_dict)
    mean_predictions.append(oos_predictions)

mean_predictions_df = pd.concat(mean_predictions, axis=1)
mean_predictions_df.reset_index(inplace=True)
mean_predictions_df.to_csv(os.path.join(results_folder, 'mean_parameter_predictions.csv'), index=False)

Discussion#

  • Reordering the attribute groupings suggests there are interactions between attributes in model training.

  • Across all orderings, the climate attributes provide most predictive information,

  • Soil attributes contribute little or no explanatory power to the model.

  • Terrain and land cover attributes provide some predictive information over soil, but the joint entropy with climate seems to represent much of this information gain,

  • Randomly permuting the order of the target variable, mean_runoff erases all predictive power.

Citations#